Automatic Multi-Lingual Information Extraction
نویسنده
چکیده
Information Extraction(IE) is a burgeoning technique because of the explosion of internet. So far, most of the IE systems are focusing on English text; and most of them are in the supervised learning framework, which requires large amount of human labor; and most of them can only work in narrow domain, which is domain dependent. These systems are difficult to be ported to other languages, other domains because of these inherent shortcomings. Currently, besides western languages like English, there are many other Asian languages which are much different from English. In English, words are delimited by white-spaces so computer can easily tokenize the input text string. In many languages like Chinese, Japanese, Thai and Korea, they do not have word boundaries between words. This poses a difficult problem for the information extraction for those languages. In this thesis, we intend to implement a self-contained, language independent automatic IE system. The system is automatic because we are using a unsupervised learning framework in which no labeled data is required for training or a semi-supervised learning framework in which small amount of labeled data and large amount of unlabeled data are used. Specificly, we deal with Chinese and English languages name entity recognition and entity relation extraction, but the system can be easily extended to any other languages and other tasks. We implement an unsupervised Chinese word segmenter, a Chinese POS tagger, and we extend maximum entropy models to incorporate unlabeled data for general information extraction.
منابع مشابه
Multi-domain Cross-lingual Information Extraction from Clean and Noisy Texts
We have created a human-annotated, multi-event, cross-lingual corpus of equivalent summaries in Spanish and English to investigate cross-lingual information extraction. The corpus contains, in addition to pairs of equivalent non-translated summaries, automatic translations of each summary produced using an available translation tool. We have developed trainable information extraction systems pe...
متن کاملExtracting Information for Automatic Indexing of Multimedia Material
This paper discusses our work on information extraction (IE) from multi-lingual, multi-media, multi-genre Language Resources, in a domain where there are many different event types. This work is being carried out in the context of MUMIS, an EU-funded project that aims at the development of basic technology for the creation of a composite index from multiple and multi-lingual sources. Our approa...
متن کاملA pattern learning-based method for temporal expression extraction and normalization from multi-lingual heterogeneous clinical texts
BACKGROUND Temporal expression extraction and normalization is a fundamental and essential step in clinical text processing and analyzing. Though a variety of commonly used NLP tools are available for medical temporal information extraction, few work is satisfactory for multi-lingual heterogeneous clinical texts. METHODS A novel method called TEER is proposed for both multi-lingual temporal e...
متن کاملExperiments in Cross Language Query Focused Multi-Document Summarization
The twin challenges of massive information overload via the web and ubiquitous computers present us with an unavoidable task: developing techniques to handle multilingual information robustly and efficiently, with as high quality performance as possible. Previous research activities on multilingual information access systems have studied cross-language information retrieval (CLIR), information ...
متن کاملNeural Relation Extraction with Multi-lingual Attention
Relation extraction has been widely used for finding unknown relational facts from the plain text. Most existing methods focus on exploiting mono-lingual data for relation extraction, ignoring massive information from the texts in various languages. To address this issue, we introduce a multi-lingual neural relation extraction framework, which employs monolingual attention to utilize the inform...
متن کاملEnhancing Multi-lingual Information Extraction via Cross-Media Inference and Fusion
We describe a new information fusion approach to integrate facts extracted from cross-media objects (videos and texts) into a coherent common representation including multi-level knowledge (concepts, relations and events). Beyond standard information fusion, we exploited video extraction results and significantly improved text Information Extraction. We further extended our methods to multi-lin...
متن کامل